Data Ingestion
Overview
In this cookbook, you will access, process, analyze, and visualize satellite data in the context of example machine learning workflows. Machine learning tasks often involve a lot of data, and in Python, data is typically stored in memory as simple NumPy arrays. However, higher-level containers built on top of NumPy arrays provide more functionality for tabular data (Pandas), multidimensional gridded data (xarray), or out-of-core and distributed data (Dask).
Likely, your approach to data access will depend not only on the intended analysis, but on features of the data, such as size and location. First, to demonstrate simple cases, we will begin by reading in a local files with Pandas and xarray. Then, as many earth science datasets are large, gridded, and stored on remote servers, we will demonstrate the combined use of Intake and Dask libraries to efficiently fetch data.
Throughout this cookbook, we will use HoloViz visualization tools, such as hvPlot, GeoViews, and Datashader, to efficiently explore and interact with the data at different steps in the pipeline.
Prerequisites
Concepts |
Importance |
Notes |
|---|---|---|
Helpful |
||
Necessary |
||
Necessary |
||
Helpful |
Time to learn: 20 minutes
Imports
import pandas as pd
import xarray as xr
import intake
import hvplot.intake
import hvplot.xarray
import warnings
# Ignore a warning about the format of epsg codes
warnings.simplefilter('ignore', FutureWarning)
About Landsat data
Before we dive into loading the data, let’s review some of the key points about the dataset to build our intuition as we move toward processing and analysis.
The data in this cookbook come from the Landsat program, which is the longest record of moderate resolution multispectral data of the Earth’s surface. This program has launched several different satellites spanning many years which are designated as Landsat 1-9. In this cookbook, we will use data from Landsat 5 and Landsat 8. Landsat 5 was launched in 1984 and retired in 2013, while Landsat 8 was launched in 2013 and will likely continue through at least mid 2023.
Source: USGS - USGS Landsat Timeline
Landsat images are available from multiple sources and in different versions that have various levels of processing applied. However, a core aspect of this data is the use of different wavelength-bands to capture multiple images of the same area - together providing much more information than a single image alone. This provides us with a stack of images for each region that we might be interested, and therefore the first three dimensions for each chunk of data consist of the spatial coordinates (x and y) and the wavelength-band. Additionally, we also have a time dimension, as we will consider a stack of images from different years to look at the change in the water level around Walker Lake in Nevada, USA.
Loading small and local data with Pandas
Now that we understand a bit about the context of the data, we’ll start with the simple case of loading small local tabular data, such as a couple .csv files into Pandas Dataframes.
First, let’s load the Landsat 5 bands:
landsat5_bands = pd.read_csv('data/landsat5_bands.csv').set_index('Band')
landsat5_bands
| Description | Range (nm) | |
|---|---|---|
| Band | ||
| 1 | Blue | 450-520 |
| 2 | Green | 520-600 |
| 3 | Red | 630-690 |
| 4 | Near-Infrared | 760-900 |
| 5 | Short Wavelength Infrared 1 | 1550-1750 |
| 6 | Thermal Infrared | 10400-12500 |
| 7 | Short Wavelength Infrared 2 | 2080-2350 |
Now let’s take a look at the relevant subset of Landsat 8 bands:
landsat8_bands = pd.read_csv('data/landsat8_bands.csv').set_index('Band')
landsat8_bands
| Description | Range (nm) | |
|---|---|---|
| Band | ||
| 1 | Coastal Aerosol | 435-451 |
| 2 | Blue | 452-512 |
| 3 | Green | 533-590 |
| 4 | Red | 636-673 |
| 5 | Near-Infrared | 851-879 |
| 6 | Short Wavelength Infrared 1 | 1566-1651 |
| 7 | Short Wavelength Infrared 2 | 2107-2294 |
Interestingly, we can see that the bands differ somewhat between the satellites. For example, ‘Red’ is Band-4 within Landsat 8, but Band-3 within Landsat 5. This will be important later on when we want to use the same bands from the different datasets.
Loading small and local data with xarray
Although initiatives are underway to enhance cloud-based data workflows, at some point, it will likely be the case that you will have some local gridded data that you want to load and preview without much fuss. For this simple case, it makes sense to use an appropriately simple API. Using xarray and hvplot, we can quickly complete this task of previewing a netCDF file.
TODO: demonstrate hvplot.explorer when released
landsat_5_crop = xr.open_dataarray('data/landsat5_crop.nc')
landsat_5_crop
<xarray.DataArray (y: 40, x: 40)>
[1600 values with dtype=float64]
Coordinates:
band int64 ...
* y (y) float64 4.297e+06 4.297e+06 4.297e+06 ... 4.292e+06 4.291e+06
* x (x) float64 3.384e+05 3.386e+05 3.387e+05 ... 3.441e+05 3.442e+05
Attributes:
transform: [ 1.500000e+02 0.000000e+00 3.323250e+05 0.000000e+00 ...
crs: +init=epsg:32611
res: [150. 150.]
is_tiled: 0
nodatavals: nan
scales: 1.0
offsets: 0.0
AREA_OR_POINT: Arealandsat_5_crop_image = landsat_5_crop.hvplot.image(geo=True, tiles='ESRI', alpha=.7)
landsat_5_crop_image
This image is just a cropped portion of a Landsat tile, soley for demonstration of xarray loading. Passing geo=True to hvPlot will plot the data in a geographic coordinate system, allowing us to use the tiles option to overlay a the plot on top of map tiles.
Loading data of any size and location using Intake
Earth science data is often complex, so it’s a good idea to become familiar with an approach that can handle data of nearly any size and location, even if your current data files are local. For this reason, we will demonstrate a couple different approaches of the Intake library.
First, we will start with a simpler case of reading the same local tabular file as above with Pandas. Then we will progressively cover more complex cases, such as loading multiple files and loading gridded data from a remote location with an Intake catalog.
training = intake.open_csv('./data/landsat5_bands.csv')
Importantly, the data is not loaded yet. So far we only have the metadata:
training
csv:
args:
urlpath: ./data/landsat5_bands.csv
description: ''
driver: intake.source.csv.CSVSource
metadata: {}
We can also peak at what format the data will be in when it gets loaded using .container:
training.container
'dataframe'
Now, To get better insight into the data without loading it all in just yet, we can inspect the data using .to_dask(). You could also use .read_chunked() to return an iterator over the chunks (or partitions) of data. Refer to the ‘Reading Data’ Intake documentation for more information about these options.
training_dd = training.to_dask()
training_dd.head()
| Band | Description | Range (nm) | |
|---|---|---|---|
| 0 | 1 | Blue | 450-520 |
| 1 | 2 | Green | 520-600 |
| 2 | 3 | Red | 630-690 |
| 3 | 4 | Near-Infrared | 760-900 |
| 4 | 5 | Short Wavelength Infrared 1 | 1550-1750 |
training_dd.info()
<class 'dask.dataframe.core.DataFrame'>
Columns: 3 entries, Band to Range (nm)
dtypes: object(2), int64(1)
Notice that now running .info() gives us a lot less information than when we loaded the data earlier using pd.read_csv, this is because the Dask representation of the data only loads what it needs, when it is needed for a computation. However, you can always explicitly load all the data into memory using .read() (but be careful to only run this on a dataset that will fit on your computer’s memory). Our sample dataset is very small, so let’s try it:
training_df = training.read()
training_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Band 7 non-null int64
1 Description 7 non-null object
2 Range (nm) 7 non-null object
dtypes: int64(1), object(2)
memory usage: 296.0+ bytes
Loading multiple local files
Intake also lets the user load and concatenate data across multiple files in one command
training = intake.open_csv(['./data/landsat5_bands.csv', './data/landsat8_bands.csv'])
If the data file names allow, this can be more simply expressed using *:
training = intake.open_csv('./data/landsat*_bands.csv')
training_df = training.read()
training_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 14 entries, 0 to 6
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Band 14 non-null int64
1 Description 14 non-null object
2 Range (nm) 14 non-null object
dtypes: int64(1), object(2)
memory usage: 448.0+ bytes
But sometimes there is data encoded in a file name or path that causes concatenated data to lose some important context. In this example, we lose the information about which version of landsat the training was done on (5 vs 8). To keep track of that information, we can use a python format string {variable:type} to specify our path and declare a new field on our data. That field will get populated based on its value in the path.
training = intake.open_csv('./data/landsat{Landsat:d}_bands.csv')
training_df = training.read().set_index(['Landsat', 'Band'])
training_df
| Description | Range (nm) | ||
|---|---|---|---|
| Landsat | Band | ||
| 5 | 1 | Blue | 450-520 |
| 2 | Green | 520-600 | |
| 3 | Red | 630-690 | |
| 4 | Near-Infrared | 760-900 | |
| 5 | Short Wavelength Infrared 1 | 1550-1750 | |
| 6 | Thermal Infrared | 10400-12500 | |
| 7 | Short Wavelength Infrared 2 | 2080-2350 | |
| 8 | 1 | Coastal Aerosol | 435-451 |
| 2 | Blue | 452-512 | |
| 3 | Green | 533-590 | |
| 4 | Red | 636-673 | |
| 5 | Near-Infrared | 851-879 | |
| 6 | Short Wavelength Infrared 1 | 1566-1651 | |
| 7 | Short Wavelength Infrared 2 | 2107-2294 |
Loading remote data with a catalog
Probably the most versatile and reproducible approach to loading data with Intake is to use a simple Catalog file to declare how the data should be loaded. The catalog lays out how the data should be loaded, defines some metadata, and specifies any patterns in the file path that should be included in the data. Here is an example of an entry in a catalog YAML file:
sources:
landsat_5_small:
description: Small version of Landsat 5 Surface Reflectance Level-2 Science Product.
driver: rasterio
cache:
- argkey: urlpath
regex: 'earth-data/landsat'
type: file
args:
urlpath: 's3://earth-data/landsat/small/LT05_L1TP_042033_19881022_20161001_01_T1_sr_band{band:d}.tif'
chunks:
band: 1
x: 50
y: 50
concat_dim: band
storage_options: {'anon': True}
The urlpath can be a path to a file, list of files, or a path with glob notation. Alternatively, the path can be written as a python style format_string. In the case where the urlpath is a format string (for example, {band:d}) , the fields specified in that string will be parsed from the filenames and returned in the data.
Let’s look at what our catalog contains using open_catalog:
cat = intake.open_catalog('./data/catalog.yml')
list(cat)
['landsat_5_small',
'landsat_8_small',
'landsat_5',
'landsat_8',
'google_landsat_band',
'amazon_landsat_band',
'fluxnet_daily',
'fluxnet_metadata',
'seattle_lidar']
Here, we see a list of datasets specified by the catalog.yml file. Lets’s look at a description and the urlpath for one of them:
landsat_5 = cat.landsat_5_small
landsat_5.description
'Small version of Landsat 5 Surface Reflectance Level-2 Science Product.'
landsat_5.container
'xarray'
From .container we can see that ultimately the data will be an xarray object
landsat_5.urlpath
's3://earth-data/landsat/small/LT05_L1TP_042033_19881022_20161001_01_T1_sr_band*.tif'
From the urlpath we can see that this dataset is hosted remotely on Amazon S3
Let’s preview this dataset using .to_dask().
landsat_5 = cat.landsat_5_small
landsat_5.to_dask()
<xarray.DataArray (band: 6, y: 300, x: 300)>
dask.array<concatenate, shape=(6, 300, 300), dtype=float64, chunksize=(1, 50, 50), chunktype=numpy.ndarray>
Coordinates:
* band (band) int64 1 2 3 4 5 7
* y (y) float64 4.309e+06 4.309e+06 4.309e+06 ... 4.264e+06 4.264e+06
* x (x) float64 3.324e+05 3.326e+05 3.327e+05 ... 3.771e+05 3.772e+05
Attributes:
transform: (150.0, 0.0, 332325.0, 0.0, -150.0, 4309275.0)
crs: +init=epsg:32611
res: (150.0, 150.0)
is_tiled: 0
nodatavals: (nan,)
scales: (1.0,)
offsets: (0.0,)
AREA_OR_POINT: AreaFrom this information we can see that there are three dimensions with coordinates x, y, and band. This is satellite data where the x and y are spatial earth coordinates and the band refers to the wavelength range of light used to capture the satellite image.
Visualizing the data
Often, data ingestion involves quickly visualizing your raw data to get a sense that things are proceeding accordingly. You have already seen how we can easily use hvPlot for bare xarray data, but hvPlot also provides interactive plotting commands for Intake, Pandas, Dask, and GeoPandas objects. We’ll look more closely at hvPlot and its options in later tutorials.
For now, let’s just plot out our different band images as columns (using col='band'). We can also set rasterize=True to use Datashader (another HoloViz tool) to render large data into a 2D histogram, where every array cell counts the data points falling into that pixel, as set by the resolution of your screen.
landsat_5.hvplot.image(col='band', cmap='viridis', xaxis=False, yaxis=False, colorbar=False, rasterize=True)
Hint: Try zooming in on one plot to see that they are all linked!
If we increase the complexity a bit and consider the likely situation where you are constantly loading and visualizing a similar type of data using an Intake catalog, you can instead just declare plotting parameters directly into the catalog!
Here is the relevant part of our catalog.yml:
metadata:
plots:
band_image:
kind: 'image'
x: 'x'
y: 'y'
groupby: 'band'
rasterize: True
width: 400
dynamic: False
landsat_5 = cat.landsat_5_small
landsat_5.to_dask()
<xarray.DataArray (band: 6, y: 300, x: 300)>
array([[[ 640., 842., 864., ..., 1250., 929., 1111.],
[ 796., 774., 707., ..., 1136., 906., 1065.],
[ 975., 707., 908., ..., 1386., 1249., 1088.],
...,
[1243., 1202., 1160., ..., 1132., 1067., 845.],
[1287., 1334., 1292., ..., 801., 934., 845.],
[1550., 1356., 1314., ..., 1309., 1636., 1199.]],
[[ 810., 1096., 1191., ..., 1749., 1266., 1411.],
[1096., 1048., 905., ..., 1556., 1217., 1411.],
[1286., 1000., 1286., ..., 1749., 1604., 1411.],
...,
[1752., 1565., 1566., ..., 1502., 1456., 1078.],
[1752., 1799., 1706., ..., 983., 1172., 1077.],
[1893., 1753., 1754., ..., 1736., 2250., 1736.]],
[[1007., 1345., 1471., ..., 2102., 1462., 1719.],
[1260., 1259., 1175., ..., 1847., 1419., 1719.],
[1555., 1175., 1555., ..., 2059., 1889., 1760.],
...,
...
...,
[2429., 2138., 2041., ..., 2175., 1885., 1301.],
[2381., 2333., 2382., ..., 1204., 1495., 1301.],
[2478., 2041., 2333., ..., 2755., 3431., 2223.]],
[[1819., 2596., 2495., ..., 2429., 1785., 2023.],
[2259., 2359., 1885., ..., 2158., 1684., 1921.],
[2865., 2291., 2664., ..., 2057., 1955., 2057.],
...,
[3081., 2679., 2612., ..., 2499., 2098., 1395.],
[2779., 2544., 2779., ..., 1429., 1596., 1496.],
[3183., 2309., 2679., ..., 3067., 3802., 2665.]],
[[1682., 2215., 2070., ..., 2072., 1440., 1780.],
[1876., 1973., 1633., ..., 1926., 1586., 1635.],
[2409., 1924., 2215., ..., 1829., 1780., 1829.],
...,
[2585., 2296., 2296., ..., 2093., 1710., 1134.],
[2393., 2344., 2489., ..., 1182., 1374., 1326.],
[2826., 2007., 2393., ..., 2860., 3724., 2333.]]])
Coordinates:
* band (band) int64 1 2 3 4 5 7
* y (y) float64 4.309e+06 4.309e+06 4.309e+06 ... 4.264e+06 4.264e+06
* x (x) float64 3.324e+05 3.326e+05 3.327e+05 ... 3.771e+05 3.772e+05
Attributes:
transform: (150.0, 0.0, 332325.0, 0.0, -150.0, 4309275.0)
crs: +init=epsg:32611
res: (150.0, 150.0)
is_tiled: 0
nodatavals: (nan,)
scales: (1.0,)
offsets: (0.0,)
AREA_OR_POINT: Arealandsat_5.hvplot.band_image()
Excellent! Since we didn’t specify that the images should to be plotted as columns, hvPlot conveniently gives us a widget to scrub through them.
Side-note: Drilling down for data access
As part of accessing data, it’s important to understand how to drill down to different representations of our data that certain processing steps might require. As we saw earlier, Dask only loads the data into memory when it really needs it for a computation or when you explicility ask, so let’s go ahead and ask.
Xarray DataArray
If the data are small enough, we can use the .read() method to read all the data into a regular xarray.DataArray. This method returns all the data into one in-memory container. Once in an xarray object the data can be more easily manipulated and visualized.
landsat_5_xda = landsat_5.read()
type(landsat_5_xda)
xarray.core.dataarray.DataArray
landsat_5_xda.head(3)
<xarray.DataArray (band: 3, y: 3, x: 3)>
array([[[ 640., 842., 864.],
[ 796., 774., 707.],
[ 975., 707., 908.]],
[[ 810., 1096., 1191.],
[1096., 1048., 905.],
[1286., 1000., 1286.]],
[[1007., 1345., 1471.],
[1260., 1259., 1175.],
[1555., 1175., 1555.]]])
Coordinates:
* band (band) int64 1 2 3
* y (y) float64 4.309e+06 4.309e+06 4.309e+06
* x (x) float64 3.324e+05 3.326e+05 3.327e+05
Attributes:
transform: (150.0, 0.0, 332325.0, 0.0, -150.0, 4309275.0)
crs: +init=epsg:32611
res: (150.0, 150.0)
is_tiled: 0
nodatavals: (nan,)
scales: (1.0,)
offsets: (0.0,)
AREA_OR_POINT: AreaNumpy Array
Under the hood, this xarray object is essentially comprised of NumPy arrays with the addition of coordinate labels (and other superpowers). Sometimes, machine learning tools (e.g. scikit-learn) accept NumPy arrays as input. These arrays are accessible to us from the xarray DataArray objects by using the .values attribute.
landsat_5_npa = landsat_5_xda.values
type(landsat_5_npa)
numpy.ndarray
landsat_5_npa.shape
(6, 300, 300)
landsat_5_npa[:3, :3, :3]
array([[[ 640., 842., 864.],
[ 796., 774., 707.],
[ 975., 707., 908.]],
[[ 810., 1096., 1191.],
[1096., 1048., 905.],
[1286., 1000., 1286.]],
[[1007., 1345., 1471.],
[1260., 1259., 1175.],
[1555., 1175., 1555.]]])
Summary
Before accessing any data, it’s a good idea to start by learning about the context and details of the dataset. This will give you the intuition to make informed decisions as you form a processing and analysis pipeline. Once you have identified a dataset to use, tailer the data access approach given the features of the data, such as size and location. If your data is local and small, a reliable approach is to access the data with either Pandas or xarray, depending on whether the data is tabular or gridded. As earth science data is often large, multidimensional, and on a remote server, a powerful and increasingly common approach is to use the combination of Intake, Dask, and xarray. Once you have accessed data, visualize it with hvPlot to ensure that it matches your expectations.
Resources and references
Authored/adapted by Demetris Roumis circa Dec, 2022
This cookbook was inspired by the EarthML tutorial. See a list of the EarthML contributors here.
The landsat timeline image is originally from USGS but discovered through earthsciencedata.org
The landsat 8 banner image is from NASA